Extracts UTF-8 sequence validation to BitString class by mward-sudo · Pull Request #717 · bartblast/hologram

mward-sudo · 2026-02-18T23:39:21Z

Closes #713

Dependencies

Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.

Summary by CodeRabbit

New Features
- Added UTF-8 decoding and validation helpers to the core bitstring utilities.
Refactor
- Consolidated UTF-8 validation logic across unicode handlers to use shared utilities, reducing duplication.
Tests
- Added comprehensive UTF-8 decoding and validation tests covering multi-byte sequences, invalid bytes, overlong encodings, surrogates, and boundary cases.

…ds parameter validation

coderabbitai · 2026-02-18T23:39:35Z

Note

Reviews paused

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds four static UTF‑8 helpers to Bitstring and refactors assets/js/erlang/unicode.mjs to use them; also adds tests covering the new helpers and integration checks. Changes are mechanical replacement of inlined UTF‑8 decoding/validation with centralized routines.

Changes

Cohort / File(s)	Summary
Bitstring UTF‑8 helpers `assets/js/bitstring.mjs`	Added static methods: `decodeUtf8CodePoint(bytes, start, length)`, `isValidUtf8CodePoint(codePoint, encodingLength)`, `isValidUtf8ContinuationByte(byte)`, and `isValidUtf8Sequence(bytes, start, length)` for UTF‑8 decoding/validation.
Unicode module refactor `assets/js/erlang/unicode.mjs`	Replaced multiple inline UTF‑8 decoding/validation code paths with calls to the new `Bitstring` helpers across normalization/character-to-binary logic; adjusted truncation/flow to rely on centralized validation.
Tests for UTF‑8 helpers `test/javascript/bitstring_test.mjs`	Added extensive tests for the new helpers and added integration checks (valid/invalid sequences, continuation bytes, overlongs, surrogates, out‑of‑range). All additions are test-only.

Sequence Diagram(s)

(omitted — changes are internal helper extraction and local refactor; no multi‑component sequential flow warranted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Port :unicode.characters_to_nfc_binary/1 to JS #544: Centralizes UTF‑8 validation/decoding logic in unicode-related code paths; overlaps with this PR's extraction into Bitstring.
Extract UTF-8 sequence length detection to Bitstring class #706: Adds getUtf8SequenceLength() to Bitstring; complements the helpers introduced here.
Port :unicode.characters_to_nfkc_binary/1 to JS #551: Modifies UTF‑8 validation/decoding in assets/js/erlang/unicode.mjs; likely touches the same locations refactored in this PR.

Suggested reviewers

bartblast

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the main change: extraction of UTF-8 sequence validation logic into the BitString class.
Linked Issues check	✅ Passed	All coding objectives from issue `#713` are met: UTF-8 validation logic centralized in BitString class with methods for decoding, validation, and sequence handling.
Out of Scope Changes check	✅ Passed	All changes are within scope: UTF-8 helper methods added to Bitstring, refactored unicode.mjs to use these helpers, and comprehensive test coverage added.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

assets/js/erlang/unicode.mjs (1)

325-340: Four identical findValidUtf8Length closures — extract to a shared module-level helper

After this PR all four normalization functions (characters_to_nfc_binary/1, characters_to_nfd_binary/1, characters_to_nfkc_binary/1, characters_to_nfkd_binary/1) define findValidUtf8Length with byte-for-byte identical bodies. Lifting it to a module-level function would eliminate the duplication.
♻️ Proposed extraction (to be placed before the `Erlang_Unicode` object)
+// Scans forward to find the longest valid UTF-8 prefix of `bytes`.
+// Returns the byte offset past the last valid sequence.
+// Time complexity: O(n).
+const findValidUtf8Length = (bytes) => {
+  let pos = 0;
+  while (pos < bytes.length) {
+    const seqLength = Bitstring.getUtf8SequenceLength(bytes[pos]);
+    if (seqLength === false || !Bitstring.isValidUtf8Sequence(bytes, pos, seqLength))
+      break;
+    pos += seqLength;
+  }
+  return pos;
+};
+
 const Erlang_Unicode = {
Then remove all four local findValidUtf8Length declarations and reference the shared one.
Also applies to: 579-594, 686-701, 792-806
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 325 - 340, Extract the duplicated
closure findValidUtf8Length into a single module-level helper placed before the
Erlang_Unicode object and remove the four local definitions inside
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 so they reference
the shared helper; keep the exact validation logic (use
Bitstring.getUtf8SequenceLength and Bitstring.isValidUtf8Sequence) and ensure
callers still pass the same bytes array so behavior is unchanged.

assets/js/bitstring.mjs (1)

259-259: Object literals allocated on every call in hot-path functions — consider static class fields

Both firstByteMasks (line 259) and minValueForLength (line 602) construct a new object on every invocation. These functions are called per-byte during UTF-8 validation, making them hot paths. Promoting these to private static class fields (arrays, indexed by sequence length) eliminates per-call allocation:

♻️ Proposed refactor — static array fields

 export default class Bitstring {
   static `#decoder` = ERTS.utf8Decoder;
   static `#encoder` = new TextEncoder("utf-8");
+  // Indexed by UTF-8 sequence length (1–4); unused positions are 0.
+  static `#UTF8_FIRST_BYTE_MASKS` = [0, 0, 0x1f, 0x0f, 0x07];
+  static `#UTF8_MIN_CODE_POINT`    = [0, 0, 0x80, 0x800, 0x10000];

   static decodeUtf8CodePoint(bytes, start, length) {
     if (length === 1) return bytes[start];
-
-    // First byte masks: 2-byte=0x1f, 3-byte=0x0f, 4-byte=0x07
-    const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07};
-
-    let codePoint = bytes[start] & firstByteMasks[length];
+    let codePoint = bytes[start] & $.#UTF8_FIRST_BYTE_MASKS[length];

   static isValidUtf8CodePoint(codePoint, encodingLength) {
-    const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};
-
-    if (codePoint < minValueForLength[encodingLength]) return false;
+    if (codePoint < $.#UTF8_MIN_CODE_POINT[encodingLength]) return false;

Also applies to: 602-602

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` at line 259, The object literals firstByteMasks and
minValueForLength are being reallocated on every hot-path call; move them out of
the functions and into private static class fields (e.g.,
BitString.#firstByteMasks and BitString.#minValueForLength) as arrays indexed by
sequence length, then replace uses of the local objects with lookups into these
static fields to avoid per-call allocations (update any functions referencing
firstByteMasks and minValueForLength to use the static field names).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@assets/js/bitstring.mjs`:
- Around line 619-634: isValidUtf8Sequence currently omits verification that
bytes[start] is a valid UTF‑8 leader for the given length; update the function's
doc comment (above static isValidUtf8Sequence) to explicitly state the
precondition that the leader byte must be pre-validated (e.g., by
getUtf8SequenceLength) and that this routine only checks bounds, continuation
bytes via $.isValidUtf8ContinuationByte and codepoint validity via
$.decodeUtf8CodePoint/$.isValidUtf8CodePoint; alternatively, if you prefer full
self‑validation, add an explicit leader‑byte check against the expected leader
bit pattern for the provided length before the continuation checks.

In `@assets/js/erlang/unicode.mjs`:
- Line 687: In characters_to_nfkc_binary/1 update the inline comment "// scan
forward, validating each sequence" to match the capitalization used in the other
findValidUtf8Length blocks (e.g. change to "// Scan forward, validating each
sequence") so the comment casing is consistent with the other occurrences in the
findValidUtf8Length implementations.

---

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Line 259: The object literals firstByteMasks and minValueForLength are being
reallocated on every hot-path call; move them out of the functions and into
private static class fields (e.g., BitString.#firstByteMasks and
BitString.#minValueForLength) as arrays indexed by sequence length, then replace
uses of the local objects with lookups into these static fields to avoid
per-call allocations (update any functions referencing firstByteMasks and
minValueForLength to use the static field names).

In `@assets/js/erlang/unicode.mjs`:
- Around line 325-340: Extract the duplicated closure findValidUtf8Length into a
single module-level helper placed before the Erlang_Unicode object and remove
the four local definitions inside characters_to_nfc_binary/1,
characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and
characters_to_nfkd_binary/1 so they reference the shared helper; keep the exact
validation logic (use Bitstring.getUtf8SequenceLength and
Bitstring.isValidUtf8Sequence) and ensure callers still pass the same bytes
array so behavior is unchanged.

assets/js/bitstring.mjs

assets/js/erlang/unicode.mjs

coderabbitai

🧹 Nitpick comments (2)

assets/js/bitstring.mjs (1)

255-269: Per-call object literal in a hot-path inner function.

firstByteMasks is re-allocated on every invocation of decodeUtf8CodePoint, which is called from within the tight while loop in findValidUtf8Length across every unicode function. Hoisting it (or isValidUtf8CodePoint's minValueForLength, see below) to a module-level constant eliminates per-call heap allocation.

♻️ Proposed refactor: hoist lookup table to module scope

+// Pre-allocated lookup tables for UTF-8 decoding/validation (avoid per-call allocation)
+const _UTF8_FIRST_BYTE_MASKS = [0, 0xff, 0x1f, 0x0f, 0x07];
+const _UTF8_MIN_CODE_POINT = [0, 0, 0x80, 0x800, 0x10000];

 static decodeUtf8CodePoint(bytes, start, length) {
   if (length === 1) return bytes[start];

-  // First byte masks: 2-byte=0x1f, 3-byte=0x0f, 4-byte=0x07
-  const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07};
-
-  let codePoint = bytes[start] & firstByteMasks[length];
+  let codePoint = bytes[start] & _UTF8_FIRST_BYTE_MASKS[length];

   for (let i = 1; i < length; i++) {
     codePoint = (codePoint << 6) | (bytes[start + i] & 0x3f);
   }

   return codePoint;
 }

 static isValidUtf8CodePoint(codePoint, encodingLength) {
-  const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};
-
-  if (codePoint < minValueForLength[encodingLength]) return false;
+  if (codePoint < _UTF8_MIN_CODE_POINT[encodingLength]) return false;
   if (codePoint >= 0xd800 && codePoint <= 0xdfff) return false;
   if (codePoint > 0x10ffff) return false;
   return true;
 }

Place the two const declarations at the top of the module (outside the class), alongside the existing private class fields area, to keep them co-located with the methods that use them.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` around lines 255 - 269, Hoist the per-call lookup
tables into module-level constants to avoid allocating them on every
decodeUtf8CodePoint invocation: move the object firstByteMasks (used in
decodeUtf8CodePoint) — and likewise minValueForLength used by
isValidUtf8CodePoint if present — out of the function and define them once at
module scope, then update decodeUtf8CodePoint and isValidUtf8CodePoint to
reference those constants instead of creating new objects each call.

assets/js/erlang/unicode.mjs (1)

326-339: Four identical findValidUtf8Length bodies — consider extracting to Bitstring.

Now that the body is entirely composed of Bitstring static calls, the implementations in characters_to_nfc_binary/1, characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 are byte-for-byte identical. A single Bitstring.findValidUtf8PrefixLength(bytes) helper would eliminate the duplication and be consistent with the PR's overall goal of centralising UTF-8 logic.

♻️ Sketch of the extraction

In assets/js/bitstring.mjs:

+  // Scans bytes forward and returns the length of the longest valid UTF-8 prefix.
+  static findValidUtf8PrefixLength(bytes) {
+    let pos = 0;
+    while (pos < bytes.length) {
+      const seqLength = $.getUtf8SequenceLength(bytes[pos]);
+      if (seqLength === false || !$.isValidUtf8Sequence(bytes, pos, seqLength))
+        break;
+      pos += seqLength;
+    }
+    return pos;
+  }

Then each findValidUtf8Length local inside the NFC/NFD/NFKC/NFKD functions collapses to:

-    const findValidUtf8Length = (bytes) => {
-      // Scan forward, validating each sequence
-      let pos = 0;
-      while (pos < bytes.length) {
-        const seqLength = Bitstring.getUtf8SequenceLength(bytes[pos]);
-        if (
-          seqLength === false ||
-          !Bitstring.isValidUtf8Sequence(bytes, pos, seqLength)
-        )
-          break;
-        pos += seqLength;
-      }
-      return pos;
-    };
+    const findValidUtf8Length = Bitstring.findValidUtf8PrefixLength;

Also applies to: 583-594, 690-701, 795-806

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 326 - 339, Extract the repeated
UTF-8 prefix scanning logic into a new
Bitstring.findValidUtf8PrefixLength(bytes) helper that loops using
Bitstring.getUtf8SequenceLength(byte) and Bitstring.isValidUtf8Sequence(bytes,
pos, seqLength) and returns the valid prefix length; add this helper to
Bitstring (e.g., assets/js/bitstring.mjs) and replace the inline loops found in
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 (and the other
occurrences around the indicated ranges) with a single call to
Bitstring.findValidUtf8PrefixLength(bytes).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@assets/js/erlang/unicode.mjs`:
- Line 687: Normalize the comment casing for consistency by changing the inline
comment "// scan forward, validating each sequence" to use sentence-style
capitalization (e.g., "// Scan forward, validating each sequence") wherever this
pattern appears in the file; update the specific comment near the Unicode
validation logic in unicode.mjs (the comment currently reading "scan forward,
validating each sequence") so it matches the project's comment casing standard.

---

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 255-269: Hoist the per-call lookup tables into module-level
constants to avoid allocating them on every decodeUtf8CodePoint invocation: move
the object firstByteMasks (used in decodeUtf8CodePoint) — and likewise
minValueForLength used by isValidUtf8CodePoint if present — out of the function
and define them once at module scope, then update decodeUtf8CodePoint and
isValidUtf8CodePoint to reference those constants instead of creating new
objects each call.

In `@assets/js/erlang/unicode.mjs`:
- Around line 326-339: Extract the repeated UTF-8 prefix scanning logic into a
new Bitstring.findValidUtf8PrefixLength(bytes) helper that loops using
Bitstring.getUtf8SequenceLength(byte) and Bitstring.isValidUtf8Sequence(bytes,
pos, seqLength) and returns the valid prefix length; add this helper to
Bitstring (e.g., assets/js/bitstring.mjs) and replace the inline loops found in
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 (and the other
occurrences around the indicated ranges) with a single call to
Bitstring.findValidUtf8PrefixLength(bytes).

mward-sudo · 2026-02-19T00:29:01Z

@coderabbitai Don't do any further reviews on this PR while it is in draft.

coderabbitai · 2026-02-19T00:29:21Z

@mward-sudo: Understood! I won't perform any further reviews on this PR while it remains in draft status. Once you're ready for a review, you can ask me to resume or change the PR status to "Ready for review."

✅ Actions performed

Reviews paused.

mward-sudo added 3 commits February 18, 2026 22:39

Extracts UTF-8 code point decoding to the BitString utility class, ad…

9a49bec

…ds parameter validation

Extracts UTF-8 continuation byte validation to BitString class

105d2d9

Extracts Utf-8 code point validation to BitString class

7015625

This was referenced Feb 18, 2026

Extracts truncated UTF-8 sequence validation to BitString class #718

Draft

Extracts UTF-8 valid length position to Bitstring class #719

Draft

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

assets/js/bitstring.mjs Outdated Show resolved Hide resolved

assets/js/erlang/unicode.mjs Show resolved Hide resolved

Extracts UTF-8 sequence validation to BitString class

b9bfe2d

mward-sudo force-pushed the 02-18-extracts_utf-8_sequence_validation_to_bitstring_class branch from 073775f to b9bfe2d Compare February 18, 2026 23:50

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

This was referenced Feb 19, 2026

Extracts BitString to code points array to BitString class #721

Draft

Extracts BitString from code point to BitString class #723

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Extracts UTF-8 sequence validation to BitString class#717

Extracts UTF-8 sequence validation to BitString class#717
mward-sudo wants to merge 4 commits intobartblast:devfrom
mward-sudo:02-18-extracts_utf-8_sequence_validation_to_bitstring_class

mward-sudo commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

mward-sudo commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Comments

Conversation

mward-sudo commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependencies

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

mward-sudo commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mward-sudo commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading